Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

310 ◾ Bioinformatics

#3. Combine files into one, sort them numerically, and collapse

redundant entries

sort -n temp.lines1 | uniq > temp.lines

rm temp.lines1

outfq1=$(echo $fq_r1| cut -d’.’ -f 1)

outfq2=$(echo $fq_r2| cut -d’.’ -f 1)

#4. Remove the line numbers recorded in “lines” from both fastqs

awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \

temp.lines $fq_r1 \

> $outfq1-$minLength.fastq

awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \

temp.lines $fq_r2 \

> $outfq2-$minLength.fastq

gzip $outfq1-$minLength.fastq

gzip $outfq2-$minLength.fastq

rm temp.lines

Once you have saved the file, you may need to make the file executable by using the Linux

command “chmod”:

chmod +x remove_PE.sh

Then, run the following commands:

./remove_PE.sh ERR1823587_pure_R1.fastq ERR1823587_pure_R2.fastq 50

./remove_PE.sh ERR1823601_pure_R1.fastq ERR1823601_pure_R2.fastq 50

./remove_PE.sh ERR1823608_pure_R1.fastq ERR1823608_pure_R2.fastq 50

Up to this step, we would have removed the host sequences from metagenomic data which are

stored in “ERR1823587_pure_R1-50.fastq.gz” and “ERR1823587_pure_R2-50.fastq.gz” for

the sample of the healthy person, “ERR1823601_pure_R1-50.fastq.gz” and “ERR1823601_

pure_R2-50.fastq.gz” for the moderate sickle cell patient, and “ERR1823608_pure_R1-50.

fastq.gz” and “ERR1823608_pure_R2-50.fastq.gz” for severe sickle cell patient. To save

some storage space, you can delete the other FASTQ files using “rm *.fastq” and also delete

all files in “fastqdir”.

The metagenomic FASTQ files are stored in “fastq_pure” as shown in Figure 8.2. Above,

we have deleted the original FASTQ files from “fastqdir” directory and also the intermedi-

ate FASTQ files from “fastq_pure” directory. You can also delete the SAM and BAM files

from “sam” directory and the reference sequences and indexes from “ref” directory if you

want to save storage space. However, you are advised to keep reference genome files in “ref”

as you may need to repeat all the steps and indexing usually takes a long time.

8.2.4 Assembly-Free Taxonomic Profiling

We can use the FASTQ files to perform taxonomic profiling without metagenome assem-

bly. This approach employs NGS short or long reads present in the metagenomic samples to

assign taxonomic groups by identifying unique genomic regions in the reads. Long reads